<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>Design of ustr, the micro string API - for C</title>
    <link rel="stylesheet" type="text/css" href="http://www.and.org/f_c-1.0.css"

    <style>
      body { width 85% };

      A       { background: #FFF; color: #44F; }
      A:hover {                   color: #20b2aa; }

      A.anchor       { color: #000; }
      A.anchor:hover { color: #000; }

      P { text-indent: 2em; }

      ul li                       { list-style-type: lower-roman; }
    </style>

  </head>

  <body>
    <h1>Design of ustr, the micro string API - for C</h1>

<ul>
<li><a href="#nonmagic">Non-Magic version of a ustr</a> </li>
<li><a href="#magic">Magic/real version of ustr</a> </li>
<li><a href="#unused">Unused bit patterns for a ustr</a> </li>
<li><a href="#english">English description of struct Ustr</a> </li>
<li><a href="#possibilities">Number of options for strings</a> </li>
</ul>

<h2 id="nonmagic"> Non-Magic version of a ustr </h2>

<pre class="c2html">
<span class="comment">/* As the saying goes:
 *   "you can tell a lot about the code from the data structure".
 *
 * So this is the non-magic version of struct Ustr:
 *
 * struct Ustr
 * {
 *   const char *ptr;
 *   size_t reference_count;
 *   size_t length;
 *   size_t size;
 *
 *   unsigned char flag_is_readonly            : 1;
 *   unsigned char flag_is_fixed               : 1;
 *   unsigned char flag_exact_byte_allocations : 1;
 *   unsigned char flag_has_enomem_error       : 1;
 *   unsigned char flag_fail_on_alloc          : 1;
 * };
 *
 * ...however this "eats" memory for "small" strings, which is one reason given
 * why people don't use a real string ADT.
 *
 */</span>
</pre>

<h2 id="magic"> Magic/real version of ustr </h2>

<pre class="c2html">
struct Ustr 
{
 <span class="unsigned">unsigned</span> <span class="char">char</span> data[1];
 <span class="comment">/* 0b_wxzy_nnoo =&gt;
  * 
  *    w         = allocated
  *     x        = has size
  *      y       = round up allocations (off == exact allocations)
  *       z      = memory error
  *         nn   = refn
  *           oo = lenn
  *
  * Eg.
  *
  * 0b_0000_0000 ==
  *           "" ==
  * not allocated, [no size], [no round up allocs], no mem err, refn=0, lenn=0
  *
  * 0b1000_0001 ==
  *        0xA5 ==
  * allocated, no size, no round up allocs,         no mem err, refn=0, lenn=1
  *
  * 0b1010_0101 ==
  *        0xA5 ==
  * allocated, no size,    round up allocs,         no mem err, refn=1, lenn=1
  */</span>
};

struct Ustrp
{ <span class="comment">/* This is just used as a way to catch errors at compile time with static
   * typing. You might think we'd define Ustrp to be identical to Ustr, and
   * just cast however aliasing rules screw us over if we do that. */</span>
  struct Ustr s;
};
</pre>

<h2 id="unused"> Unused bit patterns for a ustr </h2>

<pre class="c2html">
<span class="comment">/* Unused bit patterns for data[0]:
 *
 * 0bx0xx_x100 == lenn = 0
 * 0bx0xx_1x00
 * 0bx0x1_xx00
 * 0bx01x_xx00
 *              0bx1xx_xx00 =&gt; is used in sized
 * 0b10xx_xx00
 *
 * 0bx1xx_xx11 == 128bit size_t, yeh!
 * 0bx1xx_11xx
 *
 * 0b0xxx_x1xx == refn values for const/fixed
 * 0b0xxx_1xxx
 *
 * 0b00x1_xxxx == mem error for const
 * 0b001x_xxxx == not exact for const
 *
 */</span>
</pre>

<h2 id="english">English description of struct Ustr </h2>

<p>
  This strucure is very magic, as an optimization "" is a valid "structure"
 for an empty string, ustr_len("") == 0, ustr_size("") == 0,
 ustr_cstr("") == "".
</p><p>
  This also works for other "static data strings" if they start with the first
 four bits as 0 ... although for C const strings using this, the length has
 to be manually defined correctly at compile time.
</p><p>
  However it is also a "space efficient" String ADT, with a managed length,
 _either_ with or without reference counts (so that ustr_dup() doesn't copy)
 and with or without a stored size (so growing/shrinking a lot doesn't
 cause lots of malloc/realloc usage). Note that for lengths of strings, or
 reference counts that need &gt;= 8 bytes of storage the Ustr <b>must</b>
 contain a stored size.
</p><p>
  It can also track memory allocation errors over multiple function calls.
</p><br><p>
  If the first byte == 0, then it's "" as above.
</p><p>
  Otherwise the first byte contains four flags, and 2 numbers (2 bits each).
</p><p>
  The 1st flag says if the string is allocated. The 2nd flag says if it has a
 stored size. The 3rd flag says if it isn't using "exact allocations", if it
 is ustr_len() + ustr_overhead() == ustr_size_alloc(), otherwise there might
 be room to grow without doing a memory allocation call[1]. The 4th flag says
 if this Ustr has had a memory allocation error.
 The two numbers are the mappings for the length of the reference count, and
 the length of the length. Note that this mapping changes if the 2nd flag is
 set.
</p><p>
 [1] There is an implied "size" of the string, based an how big it is. This
 is a concession for speed efficiency, although it only grows at 1.5x
 not 2x which is the norm. (again, to reduce memory waste). Again if this is
 too much overhead, just use exact sized allocations.
</p><p>

  Also <b>NOTE</b> if the 1st and 2nd flags are set, this means that the Ustr
 is in fixed storge (like an auto array). Also if the 1st, 2nd and 3rd flags are
 set, this means that the Ustr is limited, Ie. it's in fixed storge and
 <b>cannot</b> grow (all allocation attempts will fail).
</p>

<h3>
   If there is any more data after the above declared one they have
     been allocated via. the "struct hack" method (search for more info).
</h3>

<p>
  Next, possibly, comes the reference count (of the given length[2]).
</p><p>
  Next, if not "", comes the length of the data (of the given length[2]).
</p><p>
  Next, if non-zero length, comes the data, which can include NIL bytes.
</p><p>
  And finally, if not "", a NIL byte, to make sure it's always a valid C-Style
 string (although obviously any embeded NIL bytes will make it look shorter
 if something else just uses strlen(ustr_cstr())).
</p><p>

 [2] The sizes can currently be 1, 2, 4 or 8 bytes. Depending on the mapping
 of the value in the first byte (length changes dynamically, as you add data).
</p>
<h3>
 Examples
</h3>

<p>
  The allocated string "a", using exact allocations and no reference
 counting, is the 4 bytes:
</p>

<pre class="c2html">
     {0x81, 0x01, 'a', 0x00}
</pre>

<p>
  The allocated string "ab", using non-exact allocations and 1 byte reference
 counting (shared by 13 users), is the 6 bytes (there is no waste due to
 non-exact allocations):
</p>

<pre class="c2html">
     {0xA5, 0x0D, 0x02, 'a', 'b', 0x00}
</pre>

<p>
  The allocated string "abcdefghijklmnop" (16 bytes, and a NIL), with
 non-exact allocations 2 byte reference counting (shared by 1,003 users), is
 the 24 bytes (with 3 bytes of "unused" space at the end):
</p>

<pre class="c2html">
     {0xA9, 0x03, 0xEB, 0x10, 'a', 'b', [...], 'o', 'p', 0x00, &lt;x&gt;, &lt;x&gt;, &lt;x&gt;}
</pre>

<h2 id="possibilities"> Number of options for strings </h2>

<p> This design allows 83 different combinations of options for your string
 types:
</p>

<ul>
<li> <b>Constant/read-only strings.</b>: Ie. 
<pre class="c2html">
Ustr *s1 = USTR(<span class="str">""</span>);
Ustr *s2 = USTR1(\x4, <span class="str">"abcd"</span>);
</pre>
<!-- C to html convertion of stdin -->
<!--   done on Wed May 23 19:18:26 2007 -->
</li>
<li> <b>Fixed/auto sized strings.</b>: These can seamlessly convert into
allocated strings, if you add too much data. Ie. 
<pre class="c2html">
  <span class="char">char</span> buf_s3[USTR_SIZE_FIXED(128)] = <span class="str">"x112233abcd"</span>;
  <span class="comment">/*                  x       = the info byte
                       11     = reference count storage - always two bytes
                         22   = len storage  - always same size as the size
                           33 = size storage - uses 2, 4 or 8 bytes depending on sizeof() */</span>
  Ustr *s3 = USTR_SC_INIT_AUTO(buf_s3, USTR_FALSE, 4); <span class="comment">/* s3 is now the string "abcd" */</span>
</pre>
<!-- C to html convertion of stdin -->
<!--   done on Wed May 23 19:19:30 2007 -->
</li>
<li> <b>Fixed/auto sized limited strings.</b>: These will fail to go beyond their specified size (this failure can be checked with ustr_enomem(), just like allocated strings). Ie. 
<pre class="c2html">
  <span class="char">char</span> buf_s3[USTR_SIZE_FIXED(32)] = USTR_BEG_FIXED2 <span class="str">"abcd"</span>;
  Ustr *s3 = USTR_SC_INIT_AUTO(buf_s3, USTR_TRUE, 4); <span class="comment">/* s3 is now the string "abcd" */</span>
</pre>
<!-- C to html convertion of stdin -->
<!--   done on Wed May 23 19:19:30 2007 -->
</li>
<li> <b>Allocated, no reference counts</b>: This is the lowest overhead,
 meaning that ustr_dup() etc. will always have to create a new string and do a
 copy. <i>For this loss of functionality you gain a single byte</i>. Although
 if you never share the strings and have very small string lengths this
 <i>could</i> be significant (0% overhead for an 8 byte string, total overhead
 being 37.5% or 50% with allocation rounding). </li>
<li> <b>Allocated, uint8_t reference counts</b>: If you anticipate a low
 (1-255 users) amount of sharing, and a very low average string size. You can
 allocate just a single byte to the reference count (less than 10% overhead for
 an 8 byte string, total overhead being 50%).
 </li>
<li> <b>Allocated, uint16_t reference counts</b>: This is what I'd consider
 "normal" usage, it allows a string to be duplicated at no cost upto 65,535
 times, and the reference count is only going to take up 2 bytes (less than 20%
 overhead for an 8 byte string, total overhead being 62.5% or 100% with
 allocation rounding). <i>Note this is the default allocation policy</i></li>
<li> <b>Allocated, uint32_t reference counts</b>: This might be useful if you
 want to share the string in a <i>lot</i> of places, it allows a string to be
 duplicated at no cost upto 4,294,967,295 times, but the reference count is
 going to take up 4 bytes (less than 40% overhead for an 8 byte string, total
 overhead being 87.5% or 100% with allocation rounding).</li>
<li> <b>Allocated, uint64_t reference counts</b>: This is available, allowing
 a string to be duplicated at no cost upto 18,446,744,073,709,551,615 times, 
 but the reference count is going to take up 8 bytes (less than 100% overhead
 for an 8 byte string, total overhead being 137.5% or 200% with allocation
 rounding).
<i>Note that this is only allowed on a 64bit machine.</i></li>

<li> <b>Allocated, infinite reference count</b>: If you have an allocates string
 which has any size of reference count you can call ustr_set_share() and the
 string will allow an infinite amount of duplication, but can never be free'd
 no matter how many times someone calls ustr_free(). The string can also not be
 altered anymore. Calling ustr_set_owner() returns the reference count to 1, 
 and calling ustr_free_shared() will free the string.
</li>

<li> <b>Pool allocated</b>: If you use the ustrp_*() calls instead of
 ustr_*() calls, you will get strings allocated from a "pool" of memory which
 can be free'd at once (note that ustrp_free() probably does nothing, in this
 case). </li>

<li> <b>Allocated, with explicit size</b>: For slightly extra overhead you can
 have explicitly sized allocated strings, these will have ustr_size() expand as
 needed ... but they will <b>not</b> implicitly reclaim that memory (you have
 to call ustr_realloc()). They are useful if you need to add significant amounts
 of data and then delete it before adding more. </li>

<li> <b>Allocated, with exact size</b>: For slightly less extra overhead you can
 have exactly sized allocated strings, these do not add any extra space when
 the strings need to be resized. Thus if you need to store a 4 byte string,
 using uint16_t reference counts and no stored size, you will allocate 9 bytes
 instead of 12 bytes. However to then add a single byte the system allocation
 functions will need to be called. </li>

</ul>

    <hr>
    <address><a href="mailto:james-web@and.org">James Antill</a></address>
<!-- Created: Wed May 23 19:10:32 EDT 2007 -->
<!-- hhmts start -->
Last modified: Tue Aug 28 01:06:50 EDT 2007
<!-- hhmts end -->
  </body>
</html>
